20 research outputs found
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
Visually grounded learning of keyword prediction from untranscribed speech
During language acquisition, infants have the benefit of visual cues to
ground spoken language. Robots similarly have access to audio and visual
sensors. Recent work has shown that images and spoken captions can be mapped
into a meaningful common space, allowing images to be retrieved using speech
and vice versa. In this setting of images paired with untranscribed spoken
captions, we consider whether computer vision systems can be used to obtain
textual labels for the speech. Concretely, we use an image-to-words multi-label
visual classifier to tag images with soft textual labels, and then train a
neural network to map from the speech to these soft targets. We show that the
resulting speech system is able to predict which words occur in an
utterance---acting as a spoken bag-of-words classifier---without seeing any
parallel speech and text. We find that the model often confuses semantically
related words, e.g. "man" and "person", making it even more effective as a
semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code;
accepted to Interspeech 201
What Do Self-Supervised Speech Models Know About Words?
Many self-supervised speech models (S3Ms) have been introduced over the last
few years, improving performance and data efficiency on various speech tasks.
However, these empirical successes alone do not give a complete picture of what
is learned during pre-training. Recent work has begun analyzing how S3Ms encode
certain properties, such as phonetic and speaker information, but we still lack
a proper understanding of knowledge encoded at the word level and beyond. In
this work, we use lightweight analysis methods to study segment-level
linguistic properties -- word identity, boundaries, pronunciation, syntactic
features, and semantic features -- encoded in S3Ms. We present a comparative
study of layer-wise representations from ten S3Ms and find that (i) the
frame-level representations within each word segment are not all equally
informative, and (ii) the pre-training objective and model size heavily
influence the accessibility and distribution of linguistic information across
layers. We also find that on several tasks -- word discrimination, word
segmentation, and semantic sentence similarity -- S3Ms trained with visual
grounding outperform their speech-only counterparts. Finally, our task-based
analyses demonstrate improved performance on word segmentation and acoustic
word discrimination while using simpler methods than prior work.Comment: Pre-MIT Press publication versio
Exposure to the BPA-Substitute Bisphenol S Causes Unique Alterations of Germline Function
<div><p>Concerns about the safety of Bisphenol A, a chemical found in plastics, receipts, food packaging and more, have led to its replacement with substitutes now found in a multitude of consumer products. However, several popular BPA-free alternatives, such as Bisphenol S, share a high degree of structural similarity with BPA, suggesting that these substitutes may disrupt similar developmental and reproductive pathways. We compared the effects of BPA and BPS on germline and reproductive functions using the genetic model system <i>Caenorhabditis elegans</i>. We found that, similarly to BPA, BPS caused severe reproductive defects including germline apoptosis and embryonic lethality. However, meiotic recombination, targeted gene expression, whole transcriptome and ontology analyses as well as ToxCast data mining all indicate that these effects are partly achieved via mechanisms distinct from BPAs. These findings therefore raise new concerns about the safety of BPA alternatives and the risk associated with human exposure to mixtures.</p></div
Bisphenols exposure induces DNA damage checkpoint kinase CHK-1 activation.
<p>(A) Immunostaining of phosphorylated CHK-1 on mid- to late-pachytene nuclei from dissected gonads of worms exposed to vehicle control (0.1% ethanol), 500 μM BPA,500 μM BPS or to their mixture (Scale bar, 10 μm). (B) Percentage of examined worms with elevated pCHK-1 in each group. Error bars represent SEM. N = 10 worms per trial, three repeats per treatment group. All tests are based on t statistics. **P<0.01.</p